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(57) A method and apparatus are disclosed for allo- 
cating functional units in a multithreaded very large in- 
struction word (VLIW) processor. The present invention 
combines the techniques of conventional very long in- 
struction word (VLIW) architectures and conventional 
multithreaded architectures to reduce execution time 
within an individual program, as well as across a work- 
load. The present invention utilizes instruction packet 
splitting to recover some efficiency lost with convention- 
al multithreaded architectures. Instruction packet split- 
ting allows an instruction bundle to be partially issued in 
one cycle, with the remainder of the bundle issued dur- 
ing a subsequent cycle. There are times, however, when 
instruction packets cannot be split without violating the 
semantics of the instruction packet assembled by the 
compiler. A packet split identification bit is disclosed that 
allows hardware to efficiently determine when it is per- 
missible to split an instruction packet. The split bit in- 
forms the hardware when splitting is prohibited. The al- 
location hardware assigns as many instructions from 
each packet as will fit on the available functional units, 
rather than allocating all instructions in an instruction 
packet at one time, provided the split bit has not been 
set. Those instructions that cannot be allocated to a 
functional units are retained in a ready-to-run register. 
On subsequent cycles, instruction packets in which all 
instructions have been issued to functional units are up- 
dated from their thread's instruction stream, while in- 
struction packets with instructions that have been held 
are retained. The functional unit allocation logic can then 



assign instructions from the newly- loaded instruction 
packets as well as instructions that were not issued from 
the retained instruction packets. 
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Description 

Cross-Reference to Related Applications 

[0001] The present invention is related to United 
States Patent Application entitled "Method and Appara- 
tus for Allocating Functional Units in a Multithreaded 
Very Large Instruction Word (VLIW) Processor," Attor- 
ney Docket Number (Berenbaum 7-2-3-3); United 
States Patent Application entitled "Method and Appara- 
tus for Releasing Functional Units in a Multithreaded 
Very Large Instruction Word (VLIW) Processor," Attor- 
ney Docket Number (Berenbaum 8-3-4-4); and United 
States Patent Application entitled "Method and Appara- 
tus for Splitting Packets in a Multithreaded Very Large 
Instruction Word (VLIW) Processor," Attorney Docket 
Number (Berenbaum 9-4-5-5), each filed contempora- 
neously herewith, assigned to the assignee of the 
present invention and incorporated by reference herein. 

Field of the Invention 

[0002] The present invention relates generally to mul- 
tithreaded processors, and, more particularly, to a meth- 
od and apparatus for splitting packets in such multi- 
threaded processors. 

Background of the Invention 

[0003] Computer architecture designs attempt to 
complete workloads more quickly. A number of architec- 
ture designs have been proposed or suggested for ex- 
ploiting program parallelism. Generally, an architecture 
that can issue more than one operation at a time is ca- 
pable of executing a program faster than an architecture 
that can only issue one operation at a time. Most recent 
advances in computer architecture have been directed 
towards methods of issuing more than one operation at 
a time and thereby speed up the operation of programs, 
FIG. I illustrates a conventional microprocessor archi- 
tecture 100. Specifically, the microprocessor 100 in- 
cludes a program counter (PC) 110, a register set 120 
and a number of functional units (FUs) 130-N. The re- 
dundantfunctional units (FUs) 130-1 through 130-N pro- 
vide the illustrative microprocessor architecture 100 
with sufficient hardware resources to perform a corre- 
sponding number of operations in parallel. 
[0004] An architecture that exploits parallelism in a 
program issues operands to more than one functional 
unit at a time to speed up the program execution. A 
number of architectures have been proposed or sug- 
gested with a parallel architecture, including supersca- 
lar processors, very long instruction word (VLIW) proc- 
essors and multithreaded processors, each discussed 
below in conjunction with FIGS. 2, 4 and 5, respectively. 
Generally, a superscalar processor utilizes hardware at 
run-time to dynamically determine if a number of oper- 
ations from a single instruction stream are independent, 



and if so, the processor executes the instructions using 
parallel arithmetic and logic units (ALUs). Two instruc- 
tions are said to be independent if none of the source 
operands are dependent on the destination operands of 

5 any instruction that precedes them. A very long instruc- 
tion word (VLIW) processor evaluates the instructions 
during compilation and groups the operations appropri- 
ately, for parallel execution, based on dependency in- 
formation. A multithreaded processor, on the other 

10 hand, executes more than one instruction stream in par- 
allel, rather than attempting to exploit parallelism within 
a single instruction stream. 

[0005] A superscalar processor architecture 200, 
shown in FIG. 2, has a number of functional units that 

f5 operate independently, in the event each is provided 
with valid data. For example, as shown in FIG. 2, the 
superscalar processor 200 has three functional units 
embodied as arithmetic and logic units (ALUs) 230-N, 
each of which can compute a result at the same time. 

20 The superscalar processor 200 includes a front-end 
section 208 having an instruction fetch block 21 0, an in- 
struction decode block 215, and an instruction sequenc- 
ing unit 220 (issue block). The instruction fetch block 
210 obtains instructions from an input queue 205 of a 

25 single threaded instruction stream. The instruction se- 
quencing unit 220 identifies independent instructions 
that can be executed simultaneously in the available 
arithmetic and logic units (ALUs) 230-N, in a known 
manner. The refine block 250 allows the instructions to 

30 complete, and also provides buffering and reordering for 
writing results back to the register set 240. 
[0006] In the program fragment 31 0 shown in FIG. 3, 
instructions in locations L1 , L2 and L3 are independent, 
in that none of the source operands in instructions L2 

35 and L3 are dependent on the destination operands of 
any instruction that precedes them. When the program 
counter (PC) is set to location L1 , the instruction se- 
quencing unit 220 will look ahead in the instruction 
stream and detect that the instructions at L2 and L3 are 

40 independent, and thus all three can be issued simulta- 
neously to the three available functional units 230-N. For 
a more detailed discussion of superscalar processors, 
see, for example, James. E. Smith and Gurindar. S. So- 
hi, "The Microarchitecture of Superscalar Processors," 

45 Proc. of the IEEE (Dec. 1995), incorporated by refer- 
ence herein. 

[0007] As previously indicated, a very long instruction 
word (VLIW) processor 400, shown in FIG. 4, relies on 
software to detect data parallelism at compile time from 

so a single instruction stream, rather than using hardware 
to dynamically detect parallelism at run time. A VLIW 
compiler, when presented with the source code that was 
used to generate the code fragment 31 0 in FIG. 3, would 
detect the instruction independence and construct a sin- 

55 gle, very long instruction comprised of ail three opera- 
tions. At run time, the issue logic of the processor 400 
would issue this wide instruction in one cycle, directing 
data to all available functional units 430-N. As shown in 
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FIG. 4, the very long instruction word (VLIW) processor 
400 includes an integrated fetch/decode block 420 that 
obtains the previously grouped instructions 410 from 
memory. For a more detailed discussion of very long in- 
struction word (VLIW) processors, see, for example, 
Burton J. Smith, "Architecture and Applications of the 
HEP Multiprocessor Computer System," SPIE Real 
Time Signal Procesing IV, 241 -248 (1981), incorporated 
by reference herein. 

[0008] One variety of VLIW processors, for example, 
represented by the Multiflow architecture, discussed in 
Robert P.Colwelletal.," A VLIW Architecture foraTrace 
Scheduling Compiler," IEEE Transactions on Comput- 
ers (August 1988), uses a fixed-width instruction, in 
which predefined fields direct data to all functional units 
430-N at once. When all operations specified in the wide 
instruction are completed, the processor issues a new, 
multi-operation instruction. Some more recent VLIW 
processors, such as the C6x processor commercially 
available from Texas Instruments, of Dallas, TX and the 
EPIC IA-64 processor commercially available from Intel 
Corp, of Santa Clara, CA, instead use a variable-length 
instruction packet, which contains one or more opera- 
tions bundled together. 

[0009] A multithreaded processor 500, shown in FIG. 
5, gains performance improvements by executing more 
than one instruction stream in parallel, rather than at- 
tempting to exploit parallelism within a single instruction 
stream. The multithreaded processor 500 shown in FIG. 
5 includes a program counter 510-N, a register set 
520-N and a functional unit 530-N, each dedicated to a 
corresponding instruction stream N. Alternate imple- 
mentations of the multithreaded processor 500 have uti- 
lized a single functional unit 530, with several register 
sets 520-N and program counters 510-N. Such alternate 
multithreaded processors 500 are designed in such a 
way that the processor 500 can switch instruction issue 
from one program counter/register set 510-N/520-N to 
another program counter/register set 510-N/520-N in 
one or two cycles. A long latency instruction, such as a 
LOAD instruction, can thus be overlapped with shorter 
operations from other instruction streams. The TERA 
MTA architecture, commercially available from Tera 
Computer Company, of Seattle, WA, is an example of 
this type. 

[0010] An extension of the multithreaded architecture 
500, referred to as Simultaneous Multithreading, com- 
bines the superscalar architecture, discussed above in 
conjunction with FIG. 2, with the multithreaded designs, 
discussed above in conjunction with FIG. 5. For a de- 
tailed discussion of Simultaneous Multithreading tech- 
niques, see, for example, Dean Tullsen et al., "Simulta- 
neous Multithreading: Maximizing On-Chip Parallelism, 
" Proc. of the 22nd Annual Int'l Symposium on Computer 
Architecture, 392-403 (Santa Margherita Ligure, Italy, 
June 1995), incorporated by reference herein. General- 
ly, in a Simultaneous Multithreading architecture, there 
is a pool of functional units, any number of which may 



be dynamically assigned to an instruction which can is- 
sue from any one of a number of program counter/reg- 
ister set structures. By sharing the functional units 
among a number of program threads, the Simultaneous 

5 Multithreading architecture can make more efficient use 
of hardware than that shown in FIG. 5. 
[001 1 ] While the combined approach of the Simulta- 
neous Multithreading architecture provides improved ef- 
ficiency over the individual approaches of the supersca- 

10 lar architecture orthe multithreaded architecture, Simul- 
taneous Multithreaded architectures still require elabo- 
rate issue logic to dynamically examine instruction 
streams in order to detect potential parallelism. In addi- 
tion, when an operation takes multiple cycles, the in- 

15 struction issue logic may stall, because there is no other 
source of operations available. Conventional multi- 
threaded processorss issue instructions from a set of 
instructions simultaneously, with the functional units de- 
signed to accommodate the widest potential issue. A 

20 need therefore exists for a multithreaded processor ar- 
chitecture that does not require a dynamic determina- 
tion of whether or not two instruction streams are inde- 
pendent. A further need exists for a multithreaded archi- 
tecture that provides simultaneous multithreading. Yet 

25 another need exists for a method and apparatus that im- . 
prove the utilization of processor resources for each cy- 
cle. 

Summary of the Invention 

30 

[0012] Generally, a method and apparatus, are dis- 
closed for allocating functional units in a multithreaded 
very large instruction word (VLIW) processor. The 
present invention combines the techniques of conven- 
es tional very long instruction word (VLIW) architectures 
and conventional multithreaded architectures. The com- 
bined architecture of the present invention reduces ex- 
ecution time within an individual program, as well as 
across a workload. The present invention utilizes in- 
40 struction packet splitting to recover some efficiency lost 
with conventional multithreaded architectures. Instruc- 
tion packet splitting allows an instruction bundle to be 
partially issued in one cycle, with the remainder of the 
bundle issued during a subsequent cycle. Thus, the 
45 present invention provides greater utilization of hard- 
ware resources (such as the functional units) and a low- 
er elapsed time across a workload comprising multiple 
threads. 

[001 3] There are times when instruction packets can- 
so not be split without violating the semantics of the instruc- 
tion packet assembled by the compiler. In particular, the 
input value of a register is assumed to be the same for 
instructions in a packet, even if the register is modified 
by one of the instructions in the packet. If the packet is 
55 split, and a source register for one of the instructions in 
the second part of the packet is modified by one of the 
instructions in the first part of the packet, then the com- 
piler semantics will be violated. 
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[0014] Thus, the present invention utilizes a packet 
split identification bit that allows hardware to efficiently 
determine when it is permissible to split an instruction 
packet. Instruction packet splitting increases throughput 
across all instruction threads, and reduces the number 
of cycles that the functional units rest idle. The allocation 
hardware of the present invention assigns as many in- 
structions from each packet as will fit on the available 
functional units, rather than allocating all instructions in 
an instruction packet at one time, provided the packet 
split identification bit has not been set. Those instruc- 
tions that cannot be allocated to a functional units are 
retained in a ready-to-run register. On subsequent cy- 
cles, instruction packets in which all instructions have 
been issued to functional units are updated from their 
thread's instruction stream, while instruction packets 
with instructions that have been held are retained. The 
functional unit allocation logic can then assign instruc- 
tions from the newly-loaded instruction packets as well 
as instructions that were not issued from the retained 
instruction packets. 

[0015] The present invention utilizes a compiler to de- 
tect parallelism in a multithreaded processor architec- 
ture. Thus, a multithreaded VLIW architecture is dis- 
closed that exploits program parallelism by issuing mul- 
tiple instructions, in a similar manner to single threaded 
VLIW processors, from a single program sequencer, 
and also supporting multiple program sequencers, as in 
simultaneous multithreading but with reduced complex- 
ity in the issue logic, since a dynamic determination is 
not required. 

[0016] A more complete understanding of the present 
invention, as well as further features and advantages of 
the present invention, will be obtained by reference to 
the following detailed description and drawings. 

Brief Description of the Drawings 

[0017] 

FIG. t illustrates a conventional generalized micro- 
processor architecture; 

FIG. 2 is a schematic block diagram of a conven- 
tional superscalar processor architecture; 
FIG. 3 is a program fragment illustrating the inde- 
pendence of operations; 

FIG. 4 is a schematic block diagram of a conven- 
tional very long instruction word (VLIW) processor 
architecture; 

FIG. 5 is a schematic block diagram of a conven- 
tional multithreaded processor; 
FIG. 6 illustrates a multithreaded VLIW processor 
in accordance with the present invention; 
FIG. 7A illustrates a conventional pipeline for a mul- 
tithreaded processor; 

FIG. 7B illustrates a pipeline for a multithreaded 
processor in accordance with the present invention; 
FIG. 8 is a schematic block diagram of an imple- 



mentation of the allocate stage of FIG. 7B; 
FIG. 9 illustrates the execution of three threads TA- 
TC for a conventional multithreaded implementa- 
tion, where threads B and C have higher priority 
s than thread A; 

FIGS. 10A and 10B illustrate the operation of in- 
struction packet splitting in accordance with the 
present invention; 

FIG. 11 is a program fragment that may not be split 
10 in accordance with the present invention; 

FIG. 12 is a program fragment that may be split in 
accordance with the present invention; 
FIG. 13 illustrates a packet corresponding to the 
program fragment of FIG, 12 where the instruction 
15 splitting bit has been set; 

FIG. 14 is a program fragment that may not be split 
in accordance with the present invention; and 
FIG. 15 illustrates a packet corresponding to the 
program fragment of FIG. 14 where the instruction 
20 splitting bit has not been set. 

Detailed Description 

[0018] FIG. 6 illustrates a Multithreaded VLIW proc- 
25 essor 600 in accordance with the present invention. As 
shown in FIG. 6, there are three instruction threads, 
namely, thread A (TA), thread B (TB) and thread C (TC), 
each operating at instruction number n. In addition, the 
illustrative Multithreaded VLIW processor 600 includes 
30 nine functional units 620-1 through 620-9, which can be 
allocated independently to any thread TA-TC. Since the 
number of instructions across the illustrative three 
threads TA-TC is nine and the illustrative number of 
available functional units 620 is also nine, then each of 
35 the instructions from all three threads TA-TC can issue 
their instruction packets in one cycle and move onto in- 
struction rw-7 on the subsequent cycle. 
[0019] It is noted that there is generally a one-to-one 
correspondence between instructions and the operation 
40 specified thereby. Thus, such terms are used inter- 
changeably herein. It is further noted that in the situation 
where an instruction specifies multiple operations, it is 
assumed that the multithreaded VLIW processor 600 in- 
cludes one or more multiple-operation functional units 
45 620 to execute the instruction specifying multiple oper- 
ations. An example of an architecture where instructions 
specifying multiple operations may be processed is a 
complex instruction set computer (CISC). 
[0020] The present invention allocates instructions to 
50 functional units to issue multiple VLIW instructions to 
multiple functional units in the same cycle. The alloca- 
tion mechanism of the present invention occupies a 
pipeline stage just before arguments are dispatched to 
functional units. Thus, FIG. 7A illustrates a conventional 
55 pipeline 700 comprised of a fetch stage 710, where a 
packet is obtained from memory, a decode stage 720, 
where the required functional units and registers are 
identified for the fetched instructions, and an execute 
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stage 730, where the specified operations are per- 
formed and the results are processed. 
[0021] Thus, in a conventional VLIW architecture, a 
packet containing up to K instructions is fetched each 
cycle (in the fetch stage 710). Up to K instructions are 5 
decoded in the decode stage 720 and sent to (up to) K 
functional units (FUs). The registers corresponding to 
the instructions are read, the functional units operate on 
them and the results are written back to registers in the 
execute stage 730. It is assumed that up to three regis- 
ters can be read and up to one register can be written 
per functional unit. 

[0022] FIG. 7B illustrates a pipeline 750 in accord- 
ance with the present invention, where an allocate stage 
780, discussed further below in conjunction with FIG. 8, 
is added for the implementation of multithreaded VLIW 
processors. Generally, the allocate stage 780 deter- 
mines how to group the operations together to maximize 
efficiency. Thus, the pipeline 750 includes a fetch stage 
760, where up to N packets are obtained from memory, 
a decode stage 770, where the functional units and reg- 
isters are identified for the fetched instructions (up to 
N*K instructions), an allocate stage 780, where the ap- 
propriate instructions are selected and assigned to the 
FUs, and an execute stage 790, where the specified op- 
erations are performed and the results are processed. 
[0023] In the multithreaded VLIW processor 600 of 
the present invention, up to N threads are supported in 
hardware. N thread contexts exist and contain all pos- 
sible registers of a single thread and all status informa- 
tion required. A multithreaded VLIW processor 600 has 
M functional units, where M is greater than or equal to 
K. The modified pipeline stage 750, shown in FIG. 7B, 
works in the following manner. In each cycle, up to N 
packets (each containing up to K instructions) are 
fetched at the fetch stage 760. The decode stage 770 
decodes up to N*K instructions and determines their re- 
quirements and the registers read and written. The al- 
locate stage 780 selects M out of (up to) N*K instructions 
and forwards them to the M functional units. It is as- 
sumed that each functional unit can read up to 3 regis- 
ters and write one register. In the execute stage 790, up 
to M functional units read up to 3*M registers and write 
up to M registers. 

[0024] The allocate stage 780 selects for execution 
the appropriate M instructions from the (up to) N * K in- 
structions that were fetched and decoded at stages 760 
and 770. The criteria for selection are thread priority or 
resource availability or both. Under the thread priority 
criteria, different threads can have different priorities. 
The allocate stage 780 selects and forwards the packets 
(or instructions from packets) for execution belonging to 
the thread with the highest priority according to the pri- 
ority policy implemented. A multitude of priority policies 
can be implemented. For example, a priority policy for 
a multithreaded VLIW processor supporting N contexts 
(N hardware threads) can have N priority levels. The 
highest priority thread in the processor is allocated be- 



fore any other thread. Among threads with equal priority, 
the thread that waited the longest for allocation is pre- 
ferred. 

[0025] Under the resource availability criteria, apack- 
et (having up to K instructions) can be allocated only if 
the resources (functional units) required by the packet 
are available for the next cycle. Functional units report 
their availability to the allocate stage 780. 
[0026] FIG. 8 illustrates a schematic block diagram of 
an implementation of the allocate stage 780. As shown 
in FIG. 8, the hardware needed to implement the allo- 
cate stage 780 includes a priority encoder 810 and two 
crossbar switches 820, 830. Generally, the priority en- 
coder 81 0 examines the state of the multiple operations 
in each thread, as well as the state of the available func- 
tional units. The priority encoder 81 0 selects the packets 
that are going to execute and sets up the first crossbar 
switch 820 so that the appropriate register contents are 
transferred to the functional units at the beginning of the 
next cycle. The output of the priority encoder 810 con- 
figures the first crossbar switch 820 to route data from 
selected threads to the appropriate functional units. This 
can be accomplished, for example, by sending the reg- 
ister identifiers (that include a thread identifier) to the 
functional units and letting the functional units read the 
register contents via a separate data network and using 
the crossbar switch 81 0 to move the appropriate register 
contents to latches that are read by the functional units 
at the beginning of the next cycle 
[0027] Out of the N packets that are fetched by the 
fetch stage 760 (FIG. 7B), the priority encoder 810 se- 
lects up to N packets for execution according to priority 
and resource availability. In other words, the priority en- 
coder selects the highest priority threads that do not re- 
quest unavailable resources for execution. It then sets 
up the first crossbar switch 810. The input crossbar 
switch 810 routes upto3K*N inputs to up to 3*M outputs. 
The first crossbar switch 81 0 has the ability to transfer 
the register identifiers (orthe contents of the appropriate 
registers) of each packet to the appropriate functional 
units. 

[0028] Since there are up to N threads that can be se- 
lected in the same cycle and each thread can issue a 
packet of up to K instructions and each instruction can 
read up to 3 registers there are 3K*N register identifiers 
to select from. Since there are only M functional units 
and each functional unit can accept a single instruction, 
there are only 3M register identifiers to be selected. 
Therefore, the crossbar switch implements a 3K*N to 
3M routing of register identifiers (or register contents). 
[0029] The output crossbar switch 830 routes M in- 
puts to N*M or N*K outputs. The second crossbar switch 
830 is set up at the appropriate time to transfer the re- 
sults of the functional units back to the appropriate reg- 
isters. The second crossbar switch 830 can be imple- 
mented as a separate network by sending the register 
identifiers (that contain a thread identifier) to the func- 
tional units/When a functional unit computes a result, 
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the functional unit routes the result to the given register 
Identifier. There are M results that have to be routed to 
up to N threads. Each thread can accept up to K results. 
The second crossbar switch 830 routes M results to N*K 
possible destinations. The second crossbar switch 830 
can be implemented as M buses that are connected to 
all N register files. In this case, the routing becomes M 
results to N*M possible destinations (if the register files 
have the ability to accept M results). 
[0030] In a conventional single-threaded VLIW archi- 
tecture, all operations in an instruction packet are issued 
simultaneously. There are always enough functional 
units available to issue apacket. When an operation 
takes multiple cycles, the instruction issue logic may 
stall, because there is no other source of operations 
available. In a multithreaded VLIW processor in accord- 
ance with the present invention, on the other hand, 
these restrictions do not apply. 
[0031] FIG. 9 illustrates the execution of three threads 
TA-TC for a conventional multithreaded implementation 
(without the benefit of the present invention), where 
threads B and C have higher priority than thread A. 
Since thread A runs at the lowest priority, its operations 
will be the last to be assigned. As shown in FIG. 9, five 
functional units 920 are assigned to implement the five 
operations in the current cycle of the higher priority 
threads TB and TC. Thread A has four operations, but 
there are only two functional units 920 available. Thus, 
thread A stalls for a conventional multithreaded imple- 
mentation. 

[0032] In order to maximize throughput across all 
threads, and minimize the number of cycles that the 
functional units rest idle, the present invention utilizes 
instruction packet splitting. Instead of allocating all op- 
erations in an instruction packet at one time, the alloca- 
tion hardware 780, discussed above in conjunction with 
FIG. 8, assigns as many operations from each packet 
as will fit on the available functional units. Those oper- 
ations that will not fit are retained in a ready-to-run reg- 
ister 850 (FIG. 8). On subsequent cycles, instruction 
packets in which all operations have been issued to 
functional units are updated from their thread's instruc- 
tion stream, while instruction packets with operations 
that have been held are retained. The functional unit al- 
location logic 780 can then assign operations from the 
newly-loaded instruction packets as well as operations 
that were not issued from the retained instruction pack- 
ets. 

[0033] The operation of instruction packet splitting in 
accordance with the present invention is illustrated in 
FIGS. 10A and 10B. In FIG. 10A, there are three 
threads, each with an instruction packet from location n 
ready to run at the start of cycle x. Thread A runs at the 
lowest priority, so its operations will be the last to be as- 
signed. Threads B and C require five of the seven avail- 
able functional units 1020 to execute. Only two function- 
al units 1 020-2 and 1 020-6 remain, so only the first two 
operations from thread A are assigned to execute. All 



seven functional units 1020 are now fully allocated. 
[0034] At the completion of cycle x, the instruction 
packets for threads B and C are retired. The instruction 
issue logic associated with the threads replaces the in- 

5 struction packets with those for address n+1, as shown 
in FIG. 10B. Since the instruction packet for thread A is 
not completed, the packet from address n is retained, 
with the first two operations marked completed. On the 
next cycle, x+ 1, illustrated in FIG. 10B, the final two op- 

10 erations from thread A are allocated to functional units, 
as well as all the operations from threads B and C. Thus, 
the present invention provides greater utilization of 
hardware resources (i.e., the functional units 1020) and 
a lower elapsed time across a workload comprising the 

15 multiple threads. 

[0035] An instruction packet cannot be split without 
verifying that splitting would violate the semantics of the 
instruction packet assembled by the compiler. In partic- 
ular, the input value of a register is assumed to be the 

20 same for instructions in a packet, even if the register is 
modified by one of the instructions in the packet. If the 
packet is split, and a source register for one of the in- 
structions in the second part of the packet is modified 
by one of the instructions in the first part of the packet, 

25 then the compiler semantics will be violated. This is il- 
lustrated in the program fragment 111 0 in FIG. 11: 
[0036] As shown in FIG. 11 , if instructions in locations 
L1 , L2 and L3 are assembled into an instruction packet, 
and R0 = 0, R1 = 2, and R2 = 3 before the packet exe- 

30 cutes, then the value of R0 will be 5 after the packet 
completes. On the other hand, if the packet is split and 
instruction L1 executes before L3. then the value of R0 
will be 2 after the packet completes, violating the as- 
sumptions of the compiler. 

35 [0037] One means to avoid packet splitting that would 
violate program semantics is to add hardware to the in- 
struction issue logic that would identify when destination 
registers are used as sources in other instructions in an 
instruction packet. This hardware would inhibit packet 

40 splitting when one of these read-after-write hazards ex- 
ists. The mechanism has the disadvantage of requiring 
additional hardware, which takes area resources and 
could possibly impact critical paths of a processor and 
therefore reduce the speed of the processor. 

45 [0038] A compiler can easily detect read-after-write 
hazards in an instruction packet. It can therefore choose 
to never assemble instructions with these hazards into 
an instruction packet. The hardware would then be 
forced to run these instructions serially, and thereby 

so avoid the hazard. Any instruction packet that had a read- 
after-write hazard would be considered in error, and the 
architecture would not guarantee the results. Although 
safe from semantic violations, this technique has the 
disadvantage that it does not exploit potential parallel- 

55 ism in a program, since the parallelism available in the 
instruction packet with a hazard is lost, even if the un- 
derlying hardware would not have split the packet. 
[0039] The present invention combines compiler 
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knowledge with a small amount of hardware. In the Il- 
lustrative implementation, a single bit, referred to as the 
split bit, is placed in the prefix of a multi-instruction pack- 
et to inform the hardware that this packet cannot be split. 
Since the compiler knows which packets have potential 
read-after-write hazards, it can set this bit in the packet 
prefix whenever a hazard occurs. At run-time, the hard- 
ware will not split a packet with the bit set, but instead 
will wait until all Instructions in the packet can be run in 
parallel. This concept is illustrated in FIGS. 12 through 
15. 

[0040] The compiler detects that the three instruction 
sequence 1210 in FIG. 12 can be split safely, so the split 
bit is set to 1 , as shown in FIG. 13. In FIG. 14, on the 
other hand, the three instruction sequence 1 41 0 cannot 
be split, since there is a read-after-write hazard between 
instructions L1 and L3. The split bit is therefore set to 0, 
as shown in FIG. 15 

[0041] It is to be understood that the embodiments 
and variations shown and described herein are merely 
illustrative of the principles of this invention and that var- 
ious modifications may be implemented by those skilled 
in the art without departing from the scope and spirit of 
the invention. 



Claims 

1 . A multithreaded very large instruction word (VLIW) 
processor, comprising: 

a plurality of functional units for executing a plu- 
rality of instructions from a multithreaded in- 
struction stream, said instructions being 
grouped into packets by a compiler, said com- 
piler including an indication in said packet of 
whether said instructions in said packet may be 
split; and 

an allocator that selects instructions from said 
instruction stream and forwards said instruc- 
tions to said plurality of functional units, said al- 
locator assigning instructions from at least one 
of said instruction packets to a plurality of said 
functional units if said indication indicates said 
packet may be split. 

2. The multithreaded very large instruction word 
(VLIW) processor of claim 1 , wherein said indication 
is a split bit. 

3. The multithreaded very large instruction word 
(VLtW) processor of claim 1 , wherein said allocator 
assigns as many instructions from a given instruc- 
tion packet as permitted by an availability of said 
functional units. 

4. The multithreaded very large instruction word 
(VLIW) processor of claim 1 , further comprising a 



register for storing for execution in a later cycle an 
indication of those instructions from a given instruc- 
tion packet that cannot be allocated to a functional 
unit in a given cycle. 

5 

5. The multithreaded very large instruction word 
(VLIW) processor of claim 4, wherein instruction 
packets in which all instructions have been issued 
to functional units are updated from the instruction 

10 stream of said thread. 

6. The multithreaded very large instruction word 
(VLIW) processor of claim 4, wherein instruction 
packets with instructions indicated in said register 

15 are retained. 

7. A method of processing instructions from a multi- 
threaded instruction stream in a multithreaded very 
large instruction word (VLIW) processor, compris- 

20 ing the steps of: 

executing said instructions using a plurality of 
functional units, said instructions being 
grouped into packets by a compiler, said com- 
25 piler including an indication in said packet of 

whether said instructions in said packet maybe 
split; and 

assigning instructions from at least one of said 
instruction packets to a plurality of said func- 
30 tional units if said indication indicates said 

packet may be split; and 
forwarding said selected instructions to said 
plurality of functional units. 

35 8. The method of claim 7, wherein said indication is a 
split bit. 

9. The method of claim 7, wherein said assigning step 
assigns as many instructions from a given instruc- 

40 tion packet as permitted by an availability of said 
functional units. 

10. The method of claim 7, further comprising the step 
of storing for execution in a later cycle an indication 

45 of those instructions from a given instruction packet 
that cannot be allocated to a functional unit in a giv- 
en cycle. 

11. The method of claim 10, wherein instruction pack- 
so ets in which all instructions have been issued to 

functional units are updated from the instruction 
stream of said thread. 

12. The method of claim 10, wherein instruction pack- 
55 ets with instructions indicated in said register are 

retained. 

13. An article of manufacture for processing instruc- 
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tions from an instruction stream having a plurality 
of threads in a multithreaded very large instruction 
word (VLIW) processor, comprising: 

a computer readable medium having compu- 
ter readable program code means embodied ther- 5 
eon, said computer readable program code means 
comprising program code means for causing a com- 
puter to: 

execute said instructions using a plurality of 10 
functional units, said instructions being 
grouped into packets by a compiler, said com- 
piler including an indication in said packet of 
whether said instructions in said packet may be 
split; and 15 
assign instructions from at least one of said in- 
struction packets to a plurality of said functional 
units if said indication indicates said packet 
may be split; and 

forward said selected instructions to said plu- 20 
rality of functional units. 

14. A compiler for a multithreaded very large instruction 
word (VLIW) processor, comprising: 

25 

a memory for storing computer-readable code; 
and 

a processor operatively coupled to said mem- 
ory, said processor configured to: 

30 

translate instructions from a program into 
a machine language; 

group a plurality of said instructions into a 
packet; and 

provide an indication with said packet indi- 35 
eating whether said instructions in said 
packet may be split. 

15. The compiler of claim 14, wherein said instruction 
packet can be split provided the semantics of the 40 
instruction packet assembled by the compiler are 

not violated. 

16. The compiler of claim 14, wherein said instruction 
packet can be split provided a source register for *s 
one of the instructions in a first part of said packet 

is not modified by one of the instructions in a second 
part of said packet. 

1 7. An article of manufacture for compiling instructions 50 
from an instruction stream having a plurality of 
threads for use in a multithreaded very large instruc- 
tion word (VLIW) processor, comprising: 

a computer readable medium having compu- 
ter readable program code means embodied ther- ss 
eon, said computer readable program code means 
comprising program code means for causing a com- 
puter to: 



translate instructions from a program into a ma- 
chine language; 

group a plurality of said instructions into a pack- 
et; and 

provide an indication with said packet indicat- 
ing whether said instructions in said packet may 
be split. 
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Abstract 

One of the main problems that prevents extensive use of 
VUW architectures for non-numeric programs is lack of ob- 
ject code {or binary) compatibility among different imple- 
mentations of the same architecture. This is due to exposing 
all architectural features to generate code at compile time. 
New features of a VLIW machine may lead to incorrect re- 
sults by executing the code compiled for the older machine. 
In this paper, a new approach to overcome this problem is 
presented, which we call dynamic VLIW generation (DVC). 
It is performed with the help of code annotation provided 
by the compiler, to reduce the complexity of the required 
hardware. In the DVG technique, operations are resched- 
uled for the new machine at the time of instruction cache 
miss repair. In this way, the reschedule r hardware is not lo- 
cated in the execution pipeline engine avoiding potentially 
longer cycle times. To simplify the dependency checking 
hardware, dependency information is encoded for each op- 
eration at compile time. This information can be combined 
into the final binary code, or may be provided as a separate 
file, which can be loaded into memory at execution time by 
the OS loader. In this technique operations can be resched- 
uled speculatively and a mechanism is presented to prevent 
destroying the contents of live registers. Experimental re- 
sults show that the performance of rescheduled code using 
the DVG technique is about 10% worse than code compiled 
directly for the target processor. 



1. Introduction 

A Very Long Instruction Word (VLIW) processor [9] ex- 
ploits ILP that has been extracted through compiler transfor- 
mations. The advantage of these processors is their ability 
to exploit large amounts of ILP with relatively simple and 



fast control hardware. This is in comparison with the pop- 
ular superscalar processors [12], in which ILP extraction is 
performed dynamically at run- time. 

In VLIW processors, architecturally visible parallelism 
capability is exposed to the compiler. This includes an ac- 
curate semantic model of the processor, which defines the 
operation latencies, and a description of the allowed sets 
of independent operations in a VLIW instruction. The two 
main features of VLIW architectures are multiple-operation 
instruction (MultiOp) (18] and non-unit assumed operation 
latency (NUAL) [17]. / 

In conventional RISC and superscalar processors, which 
are usually based on unit-assumed latency (UAL) execution 
semantics, it is assumed that previous operations in the se- 
quence are committed at the lime of source operand access 
for the current operation. If this is not the case (such as 
an out-of-order superscala/processor), or when the actual 
latency is greater than specified, additional hardware mech- 
anisms such as register interlocks are provided to preserve 
the program correctness. Figure 1 shows a piece of code to 
illustrate differences in interpretation of program semantics 
between NUAL and UAL. Jn Figure 1 (a), operations 3 and 
4 use the results of operations 1 and 2 respectively. In Fig- 
ure 1 (b), due to UAL semantics, both operations 3 and 4 
use the result of operation 2 and the destination register of 
operation 1 is overwritten by operation 2. In this manner, 
the semantics of the same code are different in Figures I (a) 
and (b). 

NUAL provides more efficiency in code optimization 
and generation for the compiler. As the details of the target 
processor are exposed to the compiler, the resource usage 
of the target processor can be more efficiently scheduled by 
the compiler. However, this is at the expense of introduc- 
ing some problems when the control flow at run-time is not 
the same as assumed at compile time. Unexpected events 
like exceptions dynamically change the assumed program 
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Figure 2: Scheduled code example for an issue-4 hypothet- 
ical VLfW machine (generation- 1 ). 



behavior due to control transfer to the exception handling 
routine, which causes the assumed latency of the scheduled 
operations to be non-deterministic. Hence, exact NUAL se- 
mantics are violated. 

To handle this problem, two scheduling models of 
NUAL referred to as EQ (equals) model and LEQ (less than 
or equals) model have been defined [17]. An EQ NUAL 
operation accesses its operands at the specified time and 
writes back the result exactly at its latency time. In the LEQ 
model, if the operation latency is L, it is assumed the result 
is available between one and L cycles after launching the 
operation. Therefore, unexpected run- time events are han- 
dled more easily with LEQ NUAL semantics, while with 
EQ NUAL, hardware support is required to save the sta- 
tus of the processor before execution of exception handling 
code. 



of the required hardware. The rest of this paper is orga- 
nized as follows. In section 2, related works is reviewed. 
Our new approach is described in section 3. Differences be- 
tween our approach and previous research is discussed in 
section 4. Section 5 presents the preliminary experimental 
results. Section 6 concludes the paper and describes possi- 
ble future works. 
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1.1 Object-Code Incompatibility 

When the assumed architectural features such as opera- 
lion latencies, the number of functional units and register 
file specifications are changed, object code compatibility is 
not preserved among different implementations of the same 
architecture. Figure 2 shows an example of a sequence of 
operations scheduled for a hypothetical issue-4 VLI W ma- 
chine. In Figure 3, execution of the same code, when the 
latencies for MULT and MEM functional units have in- 
creased, leads to incorrect results. A decrease in latencies 
does not violate the program semantics for the LEQ model 
but, this is not the case for the EQ model. In the case of 
changes in the numbeT of functional units, a narrower issue 
processor may execute a section of the VLIW instruction 
from the original VLIW code but, keeping correctness and 
respecting dependencies is not guaranteed. In a wider issue 
processor; the original code may be executed correctly but 
the additional new resources are not utilized. 

In this paper, a new approach to overcome this problem 
is presented. It is performed with the help of code anno- 
tation provided by the compiler to Teduce the complexity 



Figure 3: Incorrect interpretation of the code scheduled for 
VLIW (generation- 1) in a new VLIW (generation-2) with 
different operation latencies. 



2 Related Works 

Several software and hardware approaches to overcome 
the binary incompatibility problem have been proposed by 
researchers. An application program which was compiled 
for one architecture can be converted to a new code for an- 
other architecture through binary translation. This is an off- 
line approach which does not affect the run-time behavior of 
the processor. This technique has been used for large scale 
migration of programs from one generation of machine to 
the next [19,20]. 

Run-time rescheduling can be performed by a software 
method, or a hardware technique, or a combination of both. 
Figure 4 shows a run-time rescheduling technique through 
the operating system of the underlying machine. Two major 
projects have been reported which use this approach. 
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Figure 4: OS-based dynamic rescheduling. 



Conte and Sathaye described a dynamic rescheduling ap- 
proach to reschedule code on a page- fault [6]. At the first- 
time page fault, the OS detects the difference between the 
current machine's characteristics and those assumed to gen- 
erate the executable code. This page of code is rescheduled 
by special scheduling software that is invoked by the OS to 
generate a correct executable code for the current machine. 
When the same pages are accessed again, they are retrieved 
from the text swap space of the OS, where they were saved 
when displaced by new pages. A technique called the per- 
sistent rescheduled-page cache (PRC) was proposed to re- 
duce the overhead when the same pages are accessed several 
times [7]. 

Another strategy based on the approach shown in Figure 
4 was presented by Ehcioglu and Altman [8], The aim of 
this work is to emulate an old architecture on a new archi- 
tecture. It follows similar steps to those mentioned above, to 
generate a new schedule for a tree- based VLIW processor. 

Figure 5 illustrates the general outline of hardware-based 
techniques to achieve object code compatibility among dif- 
ferent generations of a VLIW architecture. In Figure 5 
(a), dynamic rescheduling is performed in the execution 
pipeline path. Rau (17] presented dynamic scheduling for 
non-unit assumed operation latencies (NUAL) execution se- 
mantics in VLIW machines. 

In Figure 5 (b). dynamic rescheduling is performed out 
of the execution pipeline path. One proposed approach in 
this category is based on a mechanism called the Mil unit. 
The fill unit was originally proposed to form larger exe- 
cutable units from microoperations [ 15]. It was extended by 
Franklin and Smotherman [10] to form VLIW instructions 
dynamically from operations that can be issued together. 
This works as follows. The fill unit receives operations in 
the program order from the conventional instruction cache 
and decoder. A group of decoded operations which can be 
issued at the same time are formed in the Mil unit buffer. 
When the buffer is full, it is written into an extra cache 
called the shadow cache. At each cycle, both the conven- 
tional instruction cache and the shadow cache are accessed 
and if the shadow cache is hit, the VLIW type instruction 




Figure 5: Outline of hardware-based techniques to achieve 
object code compatibility for different generations of the 
VLIW architecture, (a) Dynamic rescheduling at the execu- 
tion pipeline path, (b) Dynamic rescheduling before mov- 
ing operations to the instruction cache. 



from the shadow cache is sent to the functional units. Other- 
wise, one operation fetched from the conventional I-Cache 
is executed. 

The D1F (Dynamic Instruction Formating) machine [16] 
is another approach which can be considered to be based on 
Figure 5 (b). In the DIF machine, operations are executed 
for the first time by a separate processing engine referred 
to as the primary engine in [16]. At the same time, the de- 
pendency information is provided by the primary engine to 
translator hardware, which reschedules these executed op- 
erations as a DIF group. A DIF group is the unit of ex- 
ecution and consists of a sequence of VLIW instructions. 
DIF groups are stored in the DIF cache which is connected 
to another execution engine called the parallel engine and 
is similar to a VLIW execution pipeline. This approach is 
able to schedule operations speculatively above a few con- 
ditional branches. 

Miss Path Scheduling (MPS) introduced by Banerjia and 
et a). [2] is also based on the approach of Figure 5 (b). 
In this technique, operations arc rescheduled at cache miss 
time when the operations are received from a higher level 
of memory. The rescheduling hardware is able to schedule 
operations speculatively above conditional branches. Then, 
scheduled blocks are provided to an in-order execution en- 
gine. A scheduled block contains a set of VLIW- type in- 
structions. The MPS technique uses one cache to hold oper- 
ations fetched by the processor pipeline, in comparison with 
two different caches (shadow cache and instruction cache) 
required for the fill unit approach and the DIF method. 

3 A New Approach 

This section describes our proposed approach to over- 
coming the binary incompatibility problem between differ- 
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ent generations of VLIW machines. We call this dynamic 
VLJW generation (DVG). It requires ISA support and addi- 
tional hardware, but this hardware is not located on (he crit- 
ical path of the execution pipeline. Operations are resched- 

I uled at the cache miss repair. Thus, DVG is based on the 

approach in Figure 5 (b). 

To simplify dependency checking hardware, some form 
of dependency information is encoded for each operation at 
compile time. Each operation has a dependency, word. This 
information can be combined into the final binary code or 
may be provided in a separate file, which can be loaded into 

i memory by the OS loader at execution time. 

To schedule operations, the compiler generates a depen- 
dency graph to capture all dependency information in a 

i scheduling region (such as a hyperblock). Transferring all 

of this information through the object file is not practical as 
it requires a large amount of disk space. Also, this infor- 
mation should he available in a suitable form to he used by 
a dynamic rescheduler with low overhead. Therefore, for 
. DVG, the compiler provides a limited form of dependency 

information which will be processed at the time of instruc- 
tion cache miss repair. 

The simplest way of encoding dependency information 

] is by using a bit vector for each group of operations. Each 

] bit can represent a dependency to a previous operation 

based on its position in the bit vector. For example, if an 
operation has a dependency on the 1 4th operation before it, 
then bit 13 in the bit vector is set. This scheme however, 
increases the code size. If each operation is 32 or 64 bits, 
then having a 64-bit dependencyjword is unacceptable. To 
reduce the size of the dependency .word, each bit may rep- 
resent dependency to a packet of previous operations. For 
■ : \ example, if each bit represents dependency to a packet of 

I eight operations, keeping dependency information for 64 

previous operations requires only 8 bits. This is at the ex- 
pense of imposing more limitations at the time of reschedul- 
ing. As an example, if operation x is only dependent on one 
of the operations in the corresponding packet, it cannot be 
scheduled before all eight operations in that packet unless it 
is known which operation in that packet should precede it. 
This can be done by using special hardware to resolve the 
issue. Details are described in the following sections. 

3.1 Speculative Scheduling 

It has been shown that speculative scheduling is an 
essential technique to achieve higher amounts of ILP in 
general-purpose applications [ 13]. In order to perform spec- 
ulative code motion in DVG, several issues should be con- 
sidered. 

The first issue is a method to predict conditional 
branches. In our scheme, two possibilities are available. 
One is an ISA extension, so that each conditional branch 



operation has an additional bit to indicate the prediction of 
its direction {1 1]. This bit is set if the branch is predicted 
taken. The prediction can be based on profile information 
or the heuristics used by the compiler. The other possible 
method is modifying the code layout, so that the most likely 
direction of a forward conditional branch is its fall- through 
path. 
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Figure 6: An example of using the rotational remapping 
scheme in DVG with special status fields to restore regis- 
ters' original state when a branch is mispredicted, (a) A 
sequence of operations to be rescheduled, (b) State of rl 
before rescheduling of I*, (c) State of rl after moving I* 
above BR 2 . (d) Slate of r 1 after moving I* above BRi . 

The second issue is providing a mechanism to keep and 
restore the processor state when the prediction direction is 
not valid. Speculative motion of operations which write into 
a register is not allowed when the destination register is live 
on the other path. For this purpose, it is necessary to rename 
the architectural registers. 

Register renaming involves mapping an architectural 
register into a physical register where the result of the oper- 
ation is written. Following operations refer to this physical 
register to retrieve die value for their input operands. In a 
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typical hardware register renaming scheme in superscalar 
processors, a mapping table is looked up lo determine the 
current mapping of the register when the contents of the 
register is read. The mapping table is updated when the 
register is written. A large number of ports is required to 
perform this process when multiple operations are executed 
concurrently. 

To avoid this complexity, we extend a form of the rota- 
tional remap renaming scheme which was presented in [ 1 6]. 
In our scheme, the register file has several physical loca- 
tions and status fields for each architectural register. The 
status fields indicate which physical location holds the most 
recent value and how the original physical location can be 
identified when a branch is mispredicted. Figure 6 shows 
an example lo indicate how this scheme works. 

A \cqucncc of operations with two conditional branches 
is sl»i»wn in Figure 6 (a). Hie aim is to move operation I* 
abme branches BR t and BR 2 . Figure 6 (b) shows the state 
of rceiuer r I if U is not scheduled above BR X . After mov- 
ine it up above BR>. the state is changed to that shown in 
f iL'urr fricj The acrtirj, field indicates the current phys- 
ical location mapped in rl. Speculation adjustment (SA) 
Jeicrmmes how many locations should be moved to the left 
(considering a rotating remapping scheme) to reach the cor- 
rect value of the architectural register when die branch is 
mispredicted In the case of moving I* above BR2, the 
value in speculation adjustment field is set to 01 , which 
means one shift to the left to access the original value of 
rl il BR. is taken. 

Figure 6 Id) shows the case when I* is scheduled above 
BR|. In this case, 1* is speculated over two conditional 
branches and if one of them is taken the original state is 
preserved through moving two locations to the left (based 
on speculation adjustment of 10). It should be noted that, 
no operation ex ists in the fall-through path of BR, to change 
r I . Otherwise, it would not be possible to schedule 1* above 
BR! due to that dependency constraint. 

3.2 DVG Scheduling Algorithm 

We propose the DVG approach in the context of predi- 
cated code based on hypcrblock scheduling [14]. The be- 
ginning and end points of each hyperblock arc encoded in 
the ISA by the compiler. Any code motion is performed in 
the hyperblock. This means that the original hyperblock is 
rescheduled for the new machine. 

Figure 7 illustrates the DVG algorithm. At instruction 
cache miss time, the value of the PC is used to load at most 
N operations from the upper level of the memory hierarchy. 
N is implementation dependent. The issue-width of the pro- 
cessor and the implementation details of the DVG hardware 
are used to determine N, so that the amount of rescheduling 
overhead is tolerable. 
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Figure 7: Main steps of the DVG scheduling algorithm. 



After receiving each operation from the upper level of 
memory, it is scheduled and placed in the scheduling buffer. 
The scheduling buffer is similar to the reservation table 
in [2J. Its structure is like a matrix with m rows and n 
columns, where m is the maximum number of cycles (or 
the maximum number of generated VLIWs), and n is the 
issue-width of the processor. Also, a structure called the 
operation scoreboard is used to record the status of each 
processed operation. The cycle in which the result of each 
scheduled operation will be ready is recorded in the opera- 
tion scoreboard. 

Each architectural register has a counter which is loaded 
with the latency value of the operation writing into it. The 
counter is decremented each cycle when the execution of 
an operation is started until it is 0. Then, the ready bit is 
set. When a register is accessed for reading, if the result 
is not ready, all operations in the current VLIW are stalled 
until the required result is ready. The main purpose of the 
counter is to handle cases when the latency of an operation 
scheduled in the previous scheduling region is not fulfilled. 
At die same time, this mechanism preserves the processor 
state at the time of unexpected events which may increase 
the latency of an operation. The value of these counters may 
be saved and restored when necessary to keep the processor 
state. 

When the maximum number of generated VLIWs is 
reached, or the last operation of a hyperblock is scheduled, 
or N operations are processed, the rescheduling process is 
terminated and the contents of the scheduling buffer are 
transferred to the instruction cache. The amount of time 
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required for this step depends on the structure of the in- 
struction cache. In [2], two cache organizations for hold- 
ing VLIWs were considered. These are the uncompressed 
I-Cache and the compressed banked I-Cache [5]. As the 
structure of the uncompressed I-Cache is similar to the 
scheduling buffer, transferring VLIWs from the scheduling 
buffer to the I-Cache docs not require a long time in compar- 
ison to the compressed banked cache. In the latter, NOP op- 
erations are not included in the cache and special encoding 
is used to handle this. This results in longer time to transfer 
operations from the scheduling buffer to the I-Cache. 

To schedule operations speculatively, the rotational 
remapping scheme is employed as discussed above. To pre- 
vent scheduling of a branch before operations which orig- 
inally preceded it, the latest scheduled cycle is kept in a 
special register. Branches cannot be reordered. Also, an 
operation cannot be moved above a call subroutine and an 
indirect jump. The ISA of our experimental VLIW architec- 
ture includes non-trapping version of excepting operations, 
so it does not require additional hardware to keep precise 
exception handling. Otherwise, similar hardware schemes 
to those used in any multiple-issue processor to achieve pre- 
cise exception handling would be necessary. 
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Figure H: An example of DVG rescheduling process, (a) 
Code scheduled for the old machine, (b) Code rescheduled 
for the new machine. 

We use the example in Figure 8 to describe how the DVG 
approach works. Figure 8 (a) shows a code segment sched- 



uled for an issue-2 VLIW processor which is considered 
as the old-machine. Operations Ij to I 15 are located con- 
secutively in memory with their dependency words. In this 
example, predicate operands are assumed as "TRUE" and 
are not shown. Also, we assume each bit in the depen- 
dency .word is used to indicate a dependency to one pre- 
vious operation. 

Figure 8 (b) illustrates the scheduled operations for the 
new VLIW machine, which is an issue-3 processor and the 
latency for multiplication is reduced to 2 cycles in compar- 
ison to the old machine. The scheduling process is as fol- 
lows. 

When the cache miss occurs, the address of the opera- 
tion resulting in the cache miss is used to get the operation 
from the higher level of memory hierarchy with its depen- 
dency .word. The dependency jword is applied to the oper- 
ation scoreboard to find the earliest cycle for scheduling. 
Operations I A , I 2 and I 3 do not have dependency on previ- 
ous operations (as the region begins with Ii). Therefore, 
they can be scheduled in cycle 0 in the new machine. Then, 
operation U is processed. It is dependent on Ii so, refer- 
ring to the operation scoreboard determines that I 4 can be 
scheduled in cycle 2. Other operations arc processed and 
scheduled in this way. Operations Iio, In and Ii 2 are spec- 
ulative and the status fields in the related architectural reg- 
isters are set appropriately as described before. When an 
excepting operation is speculated, its non-trapping version 
is used as the speculative operation. 

In the case of backward branches, which are marked by 
the compiler, a limited form of loop unrolling can be ap- 
plied which is dependent on N (the maximum number of 
operations to be rescheduled) and the distance between the 
branch and its target. 

4 Comparison with Related Works 

The DVG approach, similarly to the fill unit [10], the DIF 
machine [16], and MPS [2], reschedules operations before 
placing them in the cache which is accessed by the paral- 
lel execution engine in the machine. The fill unit method 
lacks speculative scheduling and requires two instruction 
caches (the conventional I-Cache and the shadow cache). 
The DIF machine also employs two caches and has two dif- 
ferent execution engines. The DVG method on the other 
hand requires one instruction cache. It is an extension of 
the conventional VLIW processor which reschedules opera- 
tions dynamically at cache miss time. The most similar pub- 
lished work to the DVG method is the MPS method. In the 
MPS, operations are also rescheduled at the cache miss re- 
pair. The major difference between the DVG approach with 
the previous works (particularly with the MPS method) is 
the use of code annotation generated by the compiler to di- 
rect the rescheduling process. Also, a limited form of loop 
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unrolling can be performed for backward branches in the 
DVG. 

5 Experimental Results 

To evaluate the effectiveness of the DVG approach, ex- 
perimental results were generated based on the following 
methodology. 

We use our VLIW compiler [3], which generates predi- 
cated code through hyperblock scheduling [14] to compile 
benchmark programs. We modified the mlcache [21 ), which 
is a multi-lateral cache simulator so that at each cache miss 
at most N operations are fetched from the next level of 
the memory hierarchy and scheduled. The scheduled op- 
erations are placed into the appropriate locations of the in- 
struction cache. A perfect data cache is assumed for all ex- 
periments. The instruction cache is a 64K direct mapped 
banked cache [5]. The assumed bandwidth between the I- 
Cache and the next level oF memory is one word (32 bits) 
per cycle with six cycles latency. These are contemporary 
assumed values typical of current memory technology. To 
approximate the scheduling overhead, rive additional cycles 
are added to the total miss repair latency. This assumption 
is reasonable, especially if pipelined hardware is employed. 

To record the number of cycles, one cycle latency is as- 
sumed for each cache hit, and the processor stalls for the 
period of the cache miss repair. The number of execution 
cycles for the old machine and the speedup are obtained as 
follows. 

The number of execution cycles for each benchmark 
program is estimated based on the static code schedules 
weighted by dynamic execution frequencies obtained from 
profiling. It was reported that the accuracy of results is bet- 
ter than 95% in comparison with simulation results assum- 
ing perfect caches [1,4]. Almost the same accuracy is ex- 
pected in our case. This is because, a VLIW processor can 
be considered similar to the in-order issue processor used 
in [4] and [1]. 

Six SPEC95 integer programs are used as the bench- 
marks in this experiment. Each benchmark program is com- 
piled with our compiler for the old machine, and the re- 
quired code annotations (dependency -word and other opera- 
tion marking such as backward branches and so on) are gen- 
erated. Program traces (using the train inputs of the bench- 
marks) are generated in a proper format and are used by the 
mlcache. Speedup is calculated as the ratio of the number 
of cycles for basic block scheduling to the number of cycles 
for hyperblock scheduling. The results are generated with 
three different machine models as shown in Figure 9. 

Figure 10 shows the speedup achieved through moving 
the older machine code to the new wider machine and ap- 
piying the DVG technique. This indicates that the DVG 
technique uses additional resources in the new machine to 
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Figure 9: Issue slolxonfiguration of three machine models. 



improve the performance of the code compiled for the old 
machine. 
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Figure 10; Speedup of the rescheduled old machine code 
on a wider new machine model. 

Results in Figure 1 1 show the performance when the 
code complied for the old machine is rescheduled through 
the DVG method for the new machine in comparison to the 
code compiled for the new machine. For example, in Fig- 
ure 1 1 (c) the M4 to M8 column indicates the reduction 
in speedup for the code originally compiled for M4 and 
rescheduled for M8 compared to the speedup of the code 
compiled for the M8 machine model. 

The results show that performance of the DVG technique 
is higher for wider- issue processors. This is due to the more 
opportunities for scheduling provided when more resources 
are available. 

6 Conclusions 

In this paper, we have presented a novel approach to 
overcome the problem of binary incompatibility in VLIW 
machines. In this approach which is called DVG, opera- 
tions are rescheduled for the new machine at the time of 
instruction cache miss repair. So, the rescheduler hardware 
is not located in the execution pipeline engine avoiding po- 
tentially longer cycle times. Experimental results show that 
the performance of rescheduled code using the DVG tech- 
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Figure 11: Speedup of the rescheduled code for machine models (a) M4, (b) M6 and (c) M8. 



mquc is about 10% worse than code compiled directly for 
the target processor. 

The DVG technique was investigated in the context of 
prcdii jicJ code. Future work may include research on the 
capability of the DVG for non-f ■«. tdicated code and eval- 
uation ol the rescheduling overhead and its impact on the 
performance 
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